ROCm 與 HIP：詳盡的十章教程：GPU 性能的記憶體導向本質

在 GPU 加速中，我們必須拋棄「計算優先」的思維模式。現代性能由 記憶體管理：主機（CPU）與裝置（GPU）之間資料配置、同步與最佳化的協調。

儘管 GPU 的算術吞吐量（$TFLOPS$）已大幅躍升，但記憶體頻寬（$GB/s$）卻增長緩慢得多。這造成了一個差距，使執行單元經常處於「飢餓」狀態，等待來自顯示記憶體（VRAM）的資料到達。因此， GPU 程式設計通常就是記憶體程式設計。

此模型可視化 運算強度 （FLOPs/Byte）與效能之間的關係。應用程式通常分為兩類：

主要的效能瓶頸很少是數學運算本身；而是透過 PCIe 總線或從 HBM 移動一個位元組所帶來的延遲與能源成本。高性能程式會優先考慮資料的駐留位置，並最小化主機與裝置之間的資料傳輸。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.